Data Analytics Privacy Platform with Quantified Re-Identification Risk

ABSTRACT

The present invention is directed to a differential privacy platform in which the privacy risk of a computation can be objectively and quantitatively calculated. This measurement is performed by simulating a sophisticated privacy attack on the system for various measures of privacy cost or epsilon, and measuring the level of success of the attack. In certain embodiments, a linear program reconstruction-type attack is used. By calculating the loss of privacy resulting from sufficient attacks at a particular epsilon, the platform may calculate a level of risk for a particular use of data. The privacy budget for the use of the data may thereby be set and controlled by the platform to remain below a desired risk threshold.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. provisional patentapplication No. 63/080,333, filed on Sep. 18, 2021. Such application isincorporated herein by reference in its entirety.

BACKGROUND

Information barriers restrict the use of data analytics. Theseinformation barriers may take numerous forms. Data privacy regulations,such as the General Data Protection Regulation (GDPR) in the EuropeanUnion and the California Consumer Privacy Act (CCPA), restrict theaccess and movement of personal information. Likewise, organizations maybe subject to myriad data confidentiality contractual clauses thatrestrict the use of data as a condition to having gained access to thedata. Migration of data between locally hosted and cloud environmentsalso creates barriers. Various private agreements or best practiceslimitations may place barriers on the movement of data forconfidentiality reasons within an organization.

Some of the most highly protected private information is individualpatient medical data. In the United States, such data is protected by,among other legal frameworks, the federal Health Insurance Portabilityand Accountability Act of 1996 (“HIPAA”) and its implementingregulations. HIPAA provides quite stringent protection for various typesof medical data and also provides very significant restrictions on thestorage and transfer of this sort of information.

Although the protection of private health data is of course critical, itis also true that analytics performed with respect to medical data iscritically important in order to advance medical science and therebyimprove the quality of healthcare. The COVID-19 pandemic provides adramatic example; the ability for researchers to analyze data pertainingto COVID-19 patients and the various treatments provided to thesepatients has proven to be extremely important in the ability ofphysicians to provide improved care, leading to better outcomes forpatients subjected to this disease.

Under HIPAA, there are two methods to de-identify data so that it may bedisclosed. The first is through a “Safe Harbor” whereby eighteen typesof identifiers are removed from the data, including, for example, names,telephone numbers, email addresses, social security numbers, and thelike. It has been recently shown, however, by a team of researchersincluding Dr. Latanya Sweeney at Harvard University that this approachis not entirely adequate to protect privacy against all forms ofattacks.

The second method to de-identify data under HIPAA is the “ExpertDetermination” method. This requires that a person with appropriateknowledge of and experience with generally accepted statistical andscientific principles and methods determine that the risk is very smallthat the information could be used, alone or in combination with otherreasonably available information, by an anticipated recipient toidentify an individual who is a subject of the information. The“risk-based anonymization” concept within the GDPR is similar to theexpert determination method under HIPAA. There is no explicit numericallevel of identification risk, however, that is universally deemed tomeet the “very small” level indicated by the method.

In general, there are several principles that the expert must considerunder the expert determination. One is replicability, i.e., the riskthat the data will consistently occur in relation to an individual. Forexample, a patient's blood glucose level will vary, and thus has lowreplicability. On the other hand, a patient's birthdate has highreplicability. A second principle is data source availability, that is,how likely it is that the information is available in public or anotheravailable source. For example, the results of laboratory reports are notoften disclosed with identity information outside of healthcareenvironments, but name and demographic data often are. A third principleis distinguishability, i.e., how unique the information may be withrespect to an individual. For example, the combination of birth year,gender, and 3-digit ZIP code is unique for only a tiny number of USresidents, whereas the combination of birth date, gender, and 5-digitcode is unique for over 50% of US residents. A fourth principle is riskassessment, which combines these other principles into an overallanalysis. For example, laboratory results may be very distinguishing,but they are rarely disclosed in multiple data sources to which manypeople have access; on the other hand, demographics are highlydistinguishing, highly replicable, and are available in public datasources.

The use of “very small” as the measure of risk under HIPAA is arecognition that the risk of re-identification in a database is neverzero. If the data has any utility, then there is always somerisk—although it can be so small as to be insignificant—that the datacould be re-identified. It is also known that the lower the privacyrisk, often the lower the utility of the data will be becausede-identification to a certain point may make the data of little or nouse for its intended purpose. In general, a re-identification risk of50% is said to be at the precipice of re-identification. On the otherhand, getting re-identification risk down to 0.05% to 0.10% is generallyconsidered acceptable. The problem, however, is determining the actualre-identification risk in a particular example.

It can be seen therefore that privacy protection by expert determinationis highly subjective and variable. It would be desirable to provide amore firm, mathematical basis for determining the level of risk so thatrisk could be evaluated in an objective manner, both for determiningrisk in a particular scenario and for comparing the risk created bydifferent scenarios.

Differential privacy is a method of protecting privacy based on theprinciple that privacy is a property of a computation over a database,as opposed to the syntactic qualities of the database itself.Fundamentally, a computation is considered differentially private if itproduces approximately the same result when applied to two databasesthat differ only by the presence or absence of a single data subject'srecord. It will be understood that the level of differential privacywith respect to a particular computation will depend greatly upon thedata at issue. If, for example, a computation were performed withrespect to average income, and an individual named John Doe's incomewere near the average of the entire data set, then the result would beclose to the same and privacy loss would be low whether or not JohnDoe's data were removed. On the other hand, if John Doe's income werefar greater than others in the data set, then the result could be quitedifferent and privacy loss for this computation would be high.

Differential privacy is powerful because of the mathematical andquantifiable guarantees that it provides regarding there-identifiability of the underlying data. Differential privacy differsfrom historical approaches because of its ability to quantify themathematical risk of de-identification. Differential privacy makes itpossible to keep track of the cumulative privacy risk to a dataset overmany analyses and queries.

As of 2021, over 120 nations have laws governing data security. As aresult, compliance with all of these regulatory schemes can seemimpossible. However, major data security regulations like GDPR, CCPA,and HIPAA are unified around the concept of data anonymization. Aproblem is, however, that differential privacy techniques as they havebeen known previously do not map well to the concepts and anonymizationprotocols set out in these various privacy laws. For this reason,differential privacy has seen very limited adoption despite its greatpromise.

Historically, differential privacy research has focused on a theoreticalproblem in which the attacker has access to all possible informationconcerning a dataset other than the particular item of data that issought to be protected. The inventors hereof have recognized, however,that this all-knowledgeable adversary is not a realistic model fordetermining privacy, and does not correspond to the reasonablenessrequirements under HIPAA and GDPR protection requirements as outlinedabove. These regulations deal with real-world privacy situations, nothighly theoretical situations where the attacker has all possibleinformation.

In addition, to bring differential privacy into practical applicationsand within the framework of existing privacy regulations, there is aneed to determine what level of epsilon (i.e., the privacy “cost” of aquery) provides reasonable protection. Current work regardingdifferential privacy simply selects a particular epsilon and applies it,without providing any support for why that epsilon was chosen or whythat particular choice of epsilon provides sufficient protection underany of the various privacy regulations such as HIPAA and GDPR. In orderto use differential privacy in practical applications under theseexisting legal frameworks, a method of quantifying privacy risk thatfits within such frameworks is needed.

SUMMARY

The present invention is directed to a differential privacy platform inwhich the privacy risk of a computation can be objectively andquantitatively calculated. This measurement is performed by simulating asophisticated privacy attack on the system for various measures ofprivacy cost or epsilon, and measuring the level of success of theattack. In certain embodiments, a linear program reconstruction attackis used, as an example of one of the most sophisticated types of privacyattacks. By calculating the loss of privacy resulting from attacks at aparticular epsilon, the platform may calculate a level of risk for aparticular use of data. The privacy “budget” for the use of the data maythereby be set and controlled by the platform to remain below a desiredrisk threshold. By maintaining the privacy risk below a known threshold,the platform provides compliance with applicable privacy regulations.

In various embodiments, the present invention uses differential privacyin order to protect the confidentiality of individual patient data. Inorder to protect patient privacy while at the same time being able tomake the most high-value use of the data, the invention in variousembodiments provides an objective, quantifiable measure of there-identification risk associated with any particular use of data,thereby ensuring that no significant risk of re-identification willoccur within a proposed data analysis scenario.

In various embodiments, no raw data is exposed or moved outside itsoriginal location, thereby providing compliance with data privacy andlocalization laws and regulations. In some embodiments, the platform cananonymize verified models for privacy and compliance, and users canexport and deploy secure models outside the original data location.

In some embodiments, a computing platform can generate differentiallyprivate synthetic data representative of the underlying dataset. Thiscan enable data scientists and engineers to build data prep, datacleaning, and feature pipelines without ever seeing raw data, therebyprotecting privacy.

In some embodiments, familiar libraries and frameworks such as SQL canbe used by data scientists to define machine learning models andqueries. Users can engage with a platform according to certainembodiments by submitting simple commands using a specific API.

The invention in certain embodiments uses a metric for assessing theprivacy risk from intentional attack, wherein the probability of asuccessful privacy attack Pr(success) is equal to the likelihood ofsuccess if an attack is made Pr(success|attempt) multiplied by theprobability of an attack Pr(attempt). The invention then provides anadversary model as described above that presents the most significantrisk of attempted privacy attack, summarizes the mitigating controls inthe consortium, and presents the determined Pr(attempt) givenconsideration of these factors. Industry best practices provide areference point for deriving the Pr(attempt) value. If there are strongsecurity protocols in the system, including multi-factor authentication,HTTPS, etc., Pr(attempt) is typically set between 0.1 and 0.25. Thestrong privacy attack is used to calculate Pr(success[]attempt). Withthese two values then known, Pr(success) may be computed as the productof these two values.

In certain embodiments, the invention may employ caching to return thesame noisy result in a differential privacy implementation regardless ofthe number of times the same query is submitted by a particular user.This caching may be used to thwart certain types of privacy attacks thatattempt to filter noise by averaging of results.

In various embodiments, the platform deploys differential privacy as anenterprise-scale distributed system. Enterprises may have hundreds,thousands, and even tens of thousands of data stores, but the platformprovides a unified data layer that allows analysts to interact with dataregardless of where or how it is stored. The platform provides a privacyledger to guarantee mathematical privacy across all connected datasetsthrough a unified interface. The platform also has a rich authorizationlayer that enables permissioning based on user and data attributes. Theplatform makes it possible to control who is able to run queries andwith what type of privacy budget.

These and other features, objects and advantages of the presentinvention will become better understood from a consideration of thefollowing detailed description of the preferred embodiments and appendedclaims in conjunction with the drawings as described following:

DRAWINGS

FIG. 1 is a chart depicting levels of risk for re-identification of databased on data protection scheme.

FIG. 2 is a diagram illustrating differential privacy concepts.

FIG. 3 is a flowchart for a system according to an embodiment of thepresent invention.

FIG. 4 is a swim lane diagram for a system according to an embodiment ofthe present invention.

FIG. 5 is a swim lane diagram for external researcher SQL queriesaccording to an embodiment of the present invention.

FIG. 6 is a swim lane diagram for internal researcher SQL queriesaccording to an embodiment of the present invention.

FIG. 7 is a swim lane diagram for external researcher machine learningtraining or evaluation according to an embodiment of the presentinvention.

FIG. 8 is a swim lane diagram for internal researcher machine learningtraining or evaluation according to an embodiment of the presentinvention.

FIG. 9 is a swim lane diagram for external researcher synthetic dataqueries according to an embodiment of the present invention.

FIG. 10 is a swim lane diagram for internal researcher raw data queriesaccording to an embodiment of the present invention.

FIG. 11 is a high-level architectural diagram of a data environmentaccording to an embodiment of the present invention.

FIG. 12 illustrates an example SQL query and results that would exposeraw data if allowed, according to an embodiment of the presentinvention.

FIG. 13 illustrates an example SQL query for average length of stayaccording to an embodiment of the present invention.

FIG. 14 illustrates an example SQL query in an attempt to manipulate thesystem to expose private information, according to an embodiment of thepresent invention.

FIG. 15 illustrates an example SQL query with added noise, according toan embodiment of the present invention.

FIG. 16 illustrates an example SQL query attempt to discern private datawith quasi-identifiers, according to an embodiment of the presentinvention.

FIG. 17 illustrates an example SQL query for count of patients binned bydate range of death, according to an embodiment of the presentinvention.

FIG. 18 illustrates an example query to create a synthetic dataset,according to an embodiment of the present invention.

FIG. 19 illustrates an example query to perform machine learninganalysis, according to an embodiment of the present invention.

FIG. 20 is a chart providing a summary of exemplary settings for a dataanalytics platform according to an embodiment of the present invention.

FIG. 21 is a graphic providing an example of queries expending a queryepsilon budget, according to an embodiment of the present invention.

FIG. 22 illustrates an SQL query according to a prior art system toexecute a successful differencing attack on a database.

FIG. 23 illustrates the SQL query of FIG. 21 being defeated bydifferential privacy, according to an embodiment of the presentinvention.

FIG. 24 is a chart illustrating the results of a privacy attack atvarying values of per-query epsilon, according to an embodiment of thepresent invention.

FIG. 25 is a graph plotting the results of a privacy attack at varyingvalues of per-query epsilon, according to an embodiment of the presentinvention.

FIG. 26 illustrates an SQL query according to a prior art system toexecute a successful averaging attack on a database.

FIG. 27 is a chart illustrating results of the averaging attack of FIG.26 .

FIG. 28 is a chart illustrating the results of the averaging attack ofFIGS. 25 and 26 against an embodiment of the present invention withcaching.

FIG. 29 is a graphic illustrating an exemplary linear programmingreconstruction attack against a database.

FIG. 30 shows the results of a reconstruction attack against anembodiment of the present invention, at varying levels of total epsilon.

FIG. 31 is a density chart showing the results of a reconstructionattack against an embodiment of the present invention, at varying levelsof total epsilon.

FIG. 32 is a chart showing parameters for exemplary reconstructionattacks against an embodiment of the present invention.

FIG. 33 is a chart showing the results of an attribute inference attackagainst an embodiment of the present invention.

FIG. 34 is a chart showing the disclosure risk arising from syntheticdatasets, according to an embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT(S)

Before the present invention is described in further detail, it shouldbe understood that the invention is not limited to the particularembodiments described, and that the terms used in describing theparticular embodiments are for the purpose of describing thoseparticular embodiments only, and are not intended to be limiting, sincethe scope of the present invention will be limited only by the claims.In particular, while the invention in various embodiments is describedwith respect to the use of protected health information in variousscenarios, the invention is not so limited, and may be employed inalternative embodiments with respect to any type of data where dataprivacy is to be safeguarded.

Re-identification risk with data, including personal health data, may bebroadly divided into five levels or categories, as shown in FIG. 1 . Inthis example, risk will be discussed with respect to medical dataprotected by HIPAA in the United States, although the invention is notso limited. Data at higher levels has less risk of re-identification,but requires more effort, cost, skill, and time to re-identify to thatlevel.

At level one is readily identifiable data, that is, the raw data thatcontains personal health information. In other words, this is fullyprotected health information (PHI) with the identifiers preserved. Thistype of data can be used to achieve the highest analytics utility toresearchers, but its use also represents the greatest disclosure riskfor patients.

At level two is masked data. This data still contains personal data, butit has been masked in some manner. For example, there is sometransformation of personal information such as placing data into bandsor ranges. This can include age ranges or larger geographic areas, suchas by ZIP codes. Masked data can also include data where such things asdemographics or other identifiers are simply removed. Both the SafeHarbor and Limited Data Set provisions of the HIPAA law and regulationsproduce masked data. Masking techniques may be either reversible orirreversible. Regardless, this data still includes indirect identifiersthat create re-identification risk.

At level three is exposed data. This is data that has privacytransformations applied, but it lacks a rigorous analysis ofre-identification risk. The risk associated with disclosure of this datais difficult to quantify.

At level four is managed data. This is data that has verifiable claimsmade concerning risk assessment based on rigorous methodology. Manageddata may be identifiable above or below a certain threshold of privacyprotection. Above this threshold the data may still be considered tocontain personal information, but below the threshold it may beconsidered to not contain personal information.

At the highest level, level 5, is data that appears only in aggregatedform, that is, by combining data about multiple data subjects, whichcontains no personal information. For example, the mortality rate of acohort of patients is an aggregate of the cohort's count of individualsurvivors divided by the total cohort size. Aggregate data that isstratified by quasi-identifiers can be re-identified through privacyattacks, so in these cases rigorous analysis must be performed todetermine if the data can be considered aggregate data. True aggregatedata presents no privacy risk as it cannot be re-identified by anyone.

One can apply an additional grouping to the data types in thistier-based model. Levels 1-4 all represent “person-level” data, whereineach row in the dataset represents information about an individual.Level 5 is unique in these tiers insofar as it always representsinformation about a group of individuals, and thus is not consideredPHI.

Aggregate data is not personal health information and as such does notrequire de-identification risk management. It is therefore adequatelyde-identified by definition. However, care must be taken when presentingaggregate data for research use, since the data presented is in factaggregate data, and cannot be re-formatted or manipulated to exposeindividual health information. For example, if quasi-identifiersstratify aggregate information, one must employ privacy mechanisms toverify that the aggregate data cannot be used to re-identifyindividuals. In certain embodiments of the invention, all statisticalresults of queries are aggregate data, but because users can query forresults that are stratified by quasi-identifiers, the results do presentprivacy risk to the system. The invention, however, provides a means toexplicitly evaluate this privacy risk as explained below.

Differential privacy is based on a definition of privacy that contendsprivacy is a property of the computation over a database, as opposed tothe syntactic qualities of the database itself. Generally speaking, itholds that a computation is differentially private if it producesapproximately the same result when applied to two databases that differonly by the presence or absence of a single data subjects' record. FIG.2 provides an illustration of differential privacy with respect to thedata for a person named John Doe 10. The computation is differentiallyprivate if and only if it produces approximately the same query result16 when applied to two databases 12 and 14 that differ only by thepresence or absence of a single data subject's record.

This definition of differential privacy can be described formally via amathematical definition, Take, for example, a database D that is acollection of data elements drawn from the universe U. A row in adatabase corresponds to an individual whose privacy needs to beprotected. Each data row consists of a set of attributes A=A₁, A₂, . . ., A_(m). The set of values each attribute can take, i.e., theirattribute domain, is denoted by dom(A_(i)) where 1≤i≤m. A mechanism M:D→R_(d) is a randomized function that maps database D to a probabilitydistribution over some range and returns a vector of randomly chosenreal numbers within the range. A mechanism M is said to be (ε, δ)differentially private if adding or removing a single data item in adatabase only affects the probability of any outcome within a smallmultiplicative factor, exp(ε), with the exception of a set on which thedensities exceed that bound by a total of no more than δ.

Sensitivity of a query function ƒ represents the largest change in theoutput to the query function which can be made by a single data item.The sensitivity of function ƒ, denoted Δƒ, is defined by:

Δƒ=max|ƒ(x)−ƒ(y)|,

where the maximum is over all pairs of datasets x and y, differing by atmost one data subject.

A differentially private mechanism can be implemented by introducingnoise sampled from a Gaussian distribution. Specifically, the Gaussianmechanism adds noise sampled from a Gaussian distribution where thevariance is selected according to the sensitivity, Δƒ, and privacyparameters, ε and δ:

Gauss ( x , f , ϵ , δ ) = f ⁡ ( x ) + ( μ = 0 , σ 2 = 2 ⁢ ln ⁡ ( 1.25 / δ )· ( Δ ⁢ f ) 2 ϵ 2 )

A key observation of differential privacy mechanisms is that thevariance of the distributions from which noise is sampled isproportional to Δƒ, ε, and δ. Importantly, this is different from otherperturbation methods that sample from the same distribution for allnoise perturbations. Bounded noise methods are vulnerable to attacksthat decipher the noise parameters via differencing and averagingattacks, and then exploit this information to dynamically remove thenoise and use the accurate values to re-identify individuals and/orreconstruct a database. For the same differentially private mechanismapplied with identical privacy parameters ε and δ to two subsets of adatabase, D₁, D₂⊂D, the variance σ will be proportional to thesensitivity of the function, Δƒ, as calculated respectively for D₁ andD₂. This property increases the level of skill, time, and resourcesrequired to decipher private information from differentially privateresults as compared to statistical results released under bounded noiseapproaches.

Each differentially private query, q, executed by users of the systemare executed with user-configurable parameters such that each querysubmitted can be represented as q_(ε,δ). Each of these queries consumesa privacy budget with values ε, δ. The higher a query's ε, δ parameters,the more accurate the results will be but the lower the privacyguarantees will be, and vice versa. Furthermore, each query will reducea dataset's total budget by its configured ε, δ values.

As mentioned above, differential privacy is the only formal definitionof privacy, and it is widely accepted as a rigorous definition ofprivacy in research literature. Its use in practice, however, has beenlimited due to complications stemming from, in part, choosing theappropriate ε and δ parameters for the privacy budget. There is noformal model for selecting the appropriate ε and δ parameters in aprivacy system. Selecting values that are too low will degrade analyticsutility to the point that a system cannot serve its intended function,while selecting values too high can lead to catastrophic privacy loss.

The problem of setting the privacy budget has received less attention inresearch literature than differential privacy mechanisms themselves.Different approaches have been proposed, such as an economic method andempirical approaches. Literature on information privacy for health datasuggests some systems have budgets as high as 200,000, whiledifferential privacy practitioners have called for values as low as lessthan 1 and as much as 100¹⁰. These wide ranges provide no means forapplying differential privacy while operating in a real-worldenvironment under an applicable legal regulation such as HIPAA.

Before describing the structure and operation of a platform forproviding access to sensitive data with quantified privacy riskaccording to certain embodiments of the invention, the function of thesystem within an overall ecosystem of medical research and publicationmay be described. A number of health providers have developed extremelylarge, high-fidelity, and high-dimensional patient datasets that wouldbe of great value for medical research. If health providers could formconsortiums to share their data for medical research while stillcomplying with privacy requirements, they could leverage this data toproduce even greater returns on their medical research and improvedpatient outcomes. A typical research consortium according to certainembodiments of the present invention is composed of four types ofentities: research institutions, medical research journals, dataproviders (such as healthcare providers), and a data analytics platformprovider. In this arrangement, research medical institutions enter intoagreements to collaborate to answer important research questions. Themedical research journals receive papers describing the results of thisresearch. The medical journals may also be provided with data tocorroborate the underlying data that supports this research, in order toavoid the problem of falsified data. The data providers provide boththeir data assets and potentially also their research personnel. Thedata analytics platform provider uses its platform to de-identify datawithin a secure environment for researchers within the consortium toperform analysis without being exposed to protected health information(PHI). The overall system thus protects patient privacy while allowingfor the most advantageous use of health information from the dataprovider.

Referring now to FIG. 3 , a basic workflow within the consortium may bedescribed according to an embodiment of the invention. At step 30,principal researchers at the research institutions propose researchstudies. The research institutions have a data use agreement establishedwith the data provider for this purpose.

At step 32, a central institutional review board reviews the proposedstudies and either approves or denies them. If they are approved, theboard designates the appropriate access tier for researchers involved inthe study. All of the research will be performed in a secure cloudinfrastructure powered by the data analytics platform, as describedherein.

At step 34, analysts (in certain embodiments, being researchers) use thedata analytics platform to conduct research that is compliant with allprivacy requirements through their own computer systems networked to thedata analytics platform. The data is de-identified before it reaches theresearchers. Access to the data is provisioned according to twodifferent roles: internal researchers, who can access PHI, and externalresearchers, who can only receive de-identified information. Internalresearchers are those researchers associated with the data provider, andare given access complying with HIPAA's “Limited Data Set” standards.For example, patient data available to internal researchers may includeunrestricted patient age and five-digit ZIP codes. For external users,three types of usage patterns are available. The first is SQL queriesthrough an interactive API, which provides limited access to the securedata in the cloud infrastructure. Results from this API call areperturbed with noise to preserve privacy using differential privacytechniques. A second usage pattern is machine learning. A Keras-basedmachine learning API is available, which enables fitting and evaluationof machine learning models without ever removing data from the securecloud infrastructure. A third usage pattern is synthetic data. This datais artificial, but is statistically-comparable and computationallyderivative of the original data. Synthetic data contains no PHI. Onlyone or two of these functionalities may be available in alternativeembodiments of the invention.

It may be noted that the expert determination under HIPAA describedherein deals solely with the de-identification methods applied to theexternal researcher permission tier. HIPAA compliance of data usage byresearch institutions operating within the internal researcher tier maybe determined based on the data used and the agreements in place betweenthe data provider and the research institutions. However, there arepoints within the research workflows in which external researchers andinternal researchers may collaborate. Controls are put in place toensure that external researchers do not access protected healthinformation during the process.

Once they have been assigned an appropriate data access tier,researchers will use the data analytics platform to perform research andanalysis on the data provider's dataset. All data access takes placethrough an enclave data environment that protects the privacy andsecurity of the data provider's data. Firewalls and other appropriatehardware are employed for this purpose. The data analytics platform is acomputing framework that preserves privacy throughout the data sciencelifecycle.

Before researchers are granted access to the data analytics platform,the system will be installed and configured within the data provider'scloud infrastructure. This setup process includes privacy parameterconfiguration which is discussed below. Another phase of the systemsetup is the configuration of security safeguards to ensure that onlyauthorized users are granted access to the system. Protection of datainvolves both security and privacy protections; although the focus ofthe discussion herein is privacy, the system also may use varioussecurity mechanisms such as multi-factor authentication to providesecurity.

External researchers are able to create SQL views and execute SQLqueries through the data analytics platform API. Queries executedthrough the SQL API return approximate results protected by differentialprivacy, meaning that a controlled amount of noise is injected into theresults to protect the privacy of the data subjects. Researchers cannevertheless use these approximate results to explore the dataprovider's dataset and develop research hypotheses. Once a researcherhas settled on a hypothesis and requires exact values to be returned,the researcher sends the analysis to an internal researcher. Thisinternal researcher can run the analysis and retrieve exact results, anduse those results to provide insights to the external researcher. Theonly results that internal researchers can provide to externalresearchers are aggregate statistical results. The internal researcheris responsible for certifying that the information sent to the externalresearcher does not compromise the privacy of any individual patient.Because multiple seemingly innocuous queries can be used together touncover sensitive information, the internal researcher must be aware ofthe context and purpose of the analysis that such an external researcherperforms.

At step 36, the papers are written and reviewed for disclosure prior topublication. This review is to ensure that no PHI has inadvertently beendisclosed in the paper. At step 38 the papers are submitted to one ormore of the medical research journals for publication. These may beaccessed through specific resource centers for particular healthconcerns.

A workflow for this process just described is provided in the swim lanediagram of FIG. 4 . At step 40, the external researcher submits a studyfrom its computer system and applies with the internal review board foraccess to the data. At step 41, the board reviews the proposal andeither approves or denies the proposal. If the proposal is approved,then processing moves to steps 42 and 43, where the external researcherand internal researcher, respectively, are given access to the dataanalytics platform for the purpose of the study. The external researcherexecutes queries and generates a hypothesis on the noisy returned dataat step 44. At step 45, the external researcher contacts the internalresearcher for testing of the hypothesis against the non-noisy (i.e.,raw) version of the data. The internal researcher at step 46 evaluatesthe hypothesis as requested, and at step 47 determines what results maybe returned to the external researcher while maintaining appropriateprivacy safeguards. At step 48, the internal researcher returns thenon-disclosive, aggregated statistical results of the evaluation of thehypothesis to the external researcher, and at step 49 they may jointlyreview the results. If the results are interesting or important, such asa confirmation of the external researcher's hypothesis, then at step 50the researchers select data to include for publication. This data issent to the board for review at step 51, and if approved the researchersthen draft a manuscript for publication at step 52. The researcherssubmit the manuscript for publication to the medical journal(s) at step53, and then the general public gains access to the article uponpublication at step 54. The article may include aggregate results,because such results do not disclose PHI.

FIG. 5 details the workflow for an SQL API query for the externalresearcher. At step 60, the external researcher sends the SQL querythrough the API concerning the dataset of the data provider. At step 61,the data analytics platform executes the query against the raw dataset,but injects noise into the results as part of the differential privacyscheme. At step 62 the noisy results are returned to the externalresearcher, and at step 63 the external researcher receives the noisyresults.

FIG. 6 details the workflow for an SQL API query for the internalresearcher. Similar to FIG. 5 , the query is received at step 70 throughthe API. But in this case, the data analytics platform executes thequery against the raw dataset without injecting noise at step 71. Thetrue results are returned at step 72, and then at step 73 the internalresearcher receives the true query results through the API.

As noted above, external researchers can train and evaluate machinelearning models on SQL views within the data analytics platform. Thesemodels are defined through the Keras API, and are trained and evaluatedremotely on the data provider's clinical dataset. The models themselvesare not returned to researchers, and evaluation can only be performedusing data that exists within the secure cloud infrastructureenvironment. FIG. 7 provides a flow for this processing. At step 80, theexternal researcher requests machine learning training or evaluationthrough the corresponding API. At step 81 the data analytics platformingests raw data and executes the requested machine learning task. Atstep 82 the platform returns the status and/or the summary statistics,as applicable, to the researcher. At step 83 the external researcherreceives the status and/or summary statistics through the API.

Internal researchers may also access and export models created byexternal researchers, but these models have the propensity to memorizecharacteristics of the data they are trained on and therefore aretreated within the system as though they are private data. FIG. 8provides a detailed flow for this processing. At step 90, the internalresearcher requests training or evaluation through the appropriate API.At step 91, the data analytics platform ingests raw data and executesthe requested machine learning task. At step 92 the platform returns thestatus and/or the summary statistics, as applicable, to the internalresearcher. The internal researcher receives the status and/or summarystatistics at step 93. The internal researcher may then request atrained machine learning model through the API at step 94, and the dataanalytics platform retrieves and returns the trained machine learningmodel in response to this request at step 95. The internal researcherthen receives the trained machine learning model through the API at step96.

Again as noted above, synthetic data may be used in this processing aswell. External researchers may create and export synthetic datasetsgenerated from SQL views based on the real dataset. These syntheticdatasets retain the univariate and some of the multivariate statisticalcharacteristics of the original dataset they are based on, so they canbe used to generate research hypotheses. For example, an externalresearcher can use a synthetic dataset to prepare a script that will runa regression or hypothesis test. FIG. 9 provides a flow for thisprocessing. At step 100, the external researcher requests a syntheticdataset through the corresponding API. At step 101, the data analyticsplatform generates the synthetic dataset, and at step 102 the dataanalytics platform evaluates the privacy of the synthetic dataset beforeits release. If the synthetic dataset is sufficiently private, then thedata analytics platform releases the synthetic dataset, which isreceived by the external researcher through the corresponding API atstep 103.

An internal researcher may also use synthetic datasets. Just as in theSQL workflow, the internal researcher can run the analysis to retrieveexact results, and use those results to provide insights to the externalresearcher. As noted above, the only information an internal researchercan send back to an external researcher is aggregate statisticalresults. This flow is shown in FIG. 10 . At step 110, the internalresearcher requests the raw dataset through the corresponding API. Thedata analytics platform retrieves the raw data set at step 111, and thenthe internal researcher receives the raw data from the data analysisplatform through the corresponding API at step 112.

Now that the description of this overall system is complete, the systemsand methods by which de-identification is performed within the dataprovider's dataset in accordance with HIPAA or other applicable privacyrules may be described in greater detail. In the examples that will beprovided herein, the data provider's dataset is a relational database ofinpatient and intensive care unit (ICU) encounters that preservesvarious data fields that are not compliant with HIPAA Safe Harbor, suchas, for example, year of birth (or age) and dates of service. Thesefields are preserved to enable epidemiological research studies to beconducted on the data. Due to the presence of identifying andquasi-identifying fields, the data must be de-identified via expertdetermination under HIPAA rules. Expert determination relies on theapplication of statistical or scientific principles that result in onlya “very small” risk that an individual could be identified.

The de-identification systems and methods described herein operateaccording to three core principles. The first principle is that there isno row-level access to PHI. This means that analysts are never exposedto PHI. The entire analytics lifecycle—from data transformation tostatistical analysis—is supported without revealing row-level protectedhealth information.

The second principle is the use of noise to prevent unique disclosure.Aggregate data are produced using rigorous statistical privacytechniques to reduce the risk of revealing sensitive information.Differential privacy, as described herein, underpins this capability.

The third principle is that PHI remains secured by enforcing policiesfor fine-granted authorization that grant access to analysts withoutreleasing data publicly or ever moving data outside of the dataprovider's firewall.

The de-identification system and methods will be described first byproviding a high-level summary of the privacy mechanisms employed in thedata analytics platform. The second section will describe a summary ofthe privacy-relevant considerations and the chosen parameters in orderto achieve the HIPAA-compliant “very small” risk that an individualcould be re-identified. The third section provides a quantitativeevaluation of the privacy risk represented by the system and methods.

The data analytics platform is implemented as a cluster-computingframework that enables analysts and data scientists to perform datatransformation, feature engineering, exploratory data analysis, andmachine learning all while maintaining the privacy of patients in theunderlying data. All data access in the data analytics platform takesplace through an enclave data environment that protects the privacy andsecurity of the data provider's data. The data platform providescontrols to ensure no data can leave the data provider's cloudenvironment, where the data is hosted. FIG. 11 provides a high-leveldiagram of this environment. Data analytics platform 114 and the dataprovider dataset 116 both lie within the enclave data environment 118.Analyst 119 (operating through a remote computing device connected overa network such as the Internet to enclave data environment 118) mayaccess data provider dataset 116 only through data analytics platform114, thereby ensuring that the analysts (e.g., external researchers)never see row-level data within data provider dataset 116.

Data analytics platform 114 executes three core analytic functions:executing SQL queries through the SQL API; developing machine learningmodels through the machine learning API; and generating syntheticdatasets through the synthetic dataset API. These three functions eachhave safeguards to protect the privacy of patients whose data lies inthe data provider dataset 116. A typical de-identification assessmentwould likely require an attribute-level privacy assessment, but in thecase of data analytics platform 114, privacy is enforced by the system'smechanisms equally across all attributes. Hence, the privacy controlsdescribed herein remain effective even if additional attributes, such asfor example ZIP code fields, are added to the data provider dataset 116.

Researchers interact with data analytics platform 114 much as they wouldinteract directly with a database. However, in this case the dataremains within the data provider's cloud infrastructure environment,i.e., enclave data environment 118. Standard SQL syntax, which isfamiliar to many researchers, may be used for the necessary API calls.FIG. 12 provides an example of an SQL query that would expose raw datafrom data provider dataset 116 if it were allowed; as shown in FIG. 12 ,however, an error results because this type of query is denied by dataanalytics platform 114. Fulfilling this “*” query would result in a dumpof all patient information, thereby causing a catastrophic loss ofprivacy.

Though certain query restrictions are imposed to protect patientprivacy, the data analytics platform 114 supports aggregate statisticalqueries in standard SQL syntax. FIG. 13 demonstrates how an analyst canuse such a command to query a table for the average length of stayacross patient encounters in the data set. This type of query isallowed, but with noise added to the data as explained more fully below.

Certain aggregate queries might be manipulated to expose privateinformation. For example, a user could attempt to single out a specificuser via an identifier. FIG. 14 demonstrates a query that would singleout information about a single patient, and shows how data analyticsplatform 114 prevents this operation.

The simple protections of the type shown in FIG. 14 guards against themajority of malicious attempts on the system to expose privateinformation. However, a nefarious user that is motivated to retrievesensitive information from a dataset can launch more sophisticatedprivacy attacks against the system to attempt exfiltration of sensitive,private data. For this reason, data analytics platform 114 employs anadditional layer of protection based on differential privacy. This addsnoise to the output of statistical queries. The noise is added in acontrolled manner to minimize the impact on analytics quality, asillustrated in FIG. 15 . In this case, noise is added to the trueaverage age 58.3, and the value returned is 60.1. This data is stilluseful to an external researcher in order to form a hypothesis, but thenoise defeats many types of privacy attacks attempting to re-identifydata in data provider dataset 116.

Certain queries provide result sets that stratify statistical results byquasi-identifiers, and this information can result in unique disclosuresof protected health information. For example, a query that returns thenumber of deaths by day could uniquely disclose private information. Thedata analytics platform 114 dynamically prevents disclosive results frombeing returned to the analyst. An example of this is shown in FIG. 16 ,where an attempt to return a count by date of death is attempted. Asshown in FIG. 16 , an error is returned and no data is obtained.

Analysts can manipulate such queries as shown in FIG. 16 to return asimilar type of information but with less fidelity results. For example,on the same dataset, an analyst could query for the number of deaths perweek (instead of per day), and the differential privacy mechanism willdynamically calculate whether the results would be disclosive. If thebinned values are not disclosive, they will be returned to the analystand each week will have a carefully calculated amount of noise added tomaximize statistical utility for the analyst, while protecting theprivacy of the data subjects in the database. The method for determiningwhether the binned values are disclosed depends upon the privacy budget,as explained below.

Differential privacy protects datasets by providing mathematicalguarantees about the maximum impact that a single individual can have onthe output of a process, such as a query. The system is designed aroundthe technology of differential privacy while not adhering strictly tothe formal definition required for strong adversaries (e.g., theoreticaladversaries with possession of all possible useful information).Specifically, data analytics platform 114 employs empirical riskassessments of the system as opposed to utilizing the theoretical riskvalues produced by the privacy mechanisms of academic researchers. Whilethis does forfeit the theoretical mathematical guarantees associatedwith pure implementations of differential privacy, this implementationis quantitatively shown to adequately protect not just anonymity, butalso confidentiality. An example query in this category is shown in FIG.17 . Here, counts of dates of death are binned by week.

As noted above, another core function of data analytics platform 114 isto enable synthetic data generation. This capability allows users togenerate an artificial dataset that maintains similarities to the PHIdataset from which it was generated. An example query to create asynthetic dataset is provided in FIG. 18 . The synthetic data isgenerated using a machine learning model, and the synthesized data isevaluated for privacy risk before being returned to the user. Thesynthetic data looks structurally similar to the data from which it isgenerated and can be used by any data analytics toolkit, such as, forexample, the Python programming language's scientific computinglibraries or the R programming language and its many libraries for dataanalysis.

The third core function of data analytics platform 114 is machinelearning. Machine learning models present privacy risk because themodels can memorize sensitive data when they are trained. To mitigatethis risk, the system enables analysts to develop and evaluate machinelearning models without ever having direct access to the underlyingprotected health information. FIG. 19 demonstrates how analysts maydevelop machine learning models using the machine learning APIassociated with data analytics platform 114. The researcher sets upKeras as the tool by which the model will be developed, defines themodel with a template, and instructs the data analytics platform 114 totrain the model using the desired data, focusing on the health issueunderlying the researcher's particular hypothesis. The output includes agraph providing sensitivity and specificity analytics with the areaunder the receiver operating characteristic (ROC) curve. The researchermay evaluate the model with standard regression and classificationmetrics, including the use of confusion table metrics, but cannotretrieve the actual trained model or its coefficients as those mayreveal private data.

The data analytics platform 114 has a number of parameters that must beconfigured prior to use. These ensure effective protection of PHI (or,in non-medical applications, other types of private information). In oneillustrative example, the settings are summarized in the chart of FIG.20 . In this example, the re-identification risk is set very low at aconservative value of 0.05 percent. The overall privacy “budget” is setwith a per-table epsilon value of 1000, and a per-query epsilon valuebudget of 1. The “likelihood constant for privacy attack probability” isa measure of the likelihood of any user who has security access to thesystem actually engaging in an attempt to defeat privacy safeguards inthe system. These issues are each explained more fully below.

The re-identification risk threshold refers to the maximum accepted riskthat an individual could be re-identified in the dataset. A risk valueless than the threshold is considered a “very small” risk. There aremany statistical methodologies for assessing re-identification risk thatmeasure different types of privacy risk. One methodology for assessingre-identification risk focuses on quantifying what acceptable riskvalues are for data released using HIPAA Safe Harbor, and demonstratingthat an expert determination method presents less re-identification riskto patients than the Safe Harbor approach. In one survey ofre-identification risk under Safe Harbor, researchers analyzed whatpercentage of unique disclosures are permissible in data released bySafe Harbor. They found that it is acceptable under Safe Harbor forindividuals to have 4% uniqueness, indicating groups containing onlyfour individuals can be reported in de-identified data.

Another methodology focuses on closely interpreting guidance from theDepartment of Health and Human Services (HHS). HHS hosts guidance fromthe Centers for Medicare and Medicaid services (CMS) on its website. Theguidance states that “no cell (e.g. admissions, discharges, patients,services, etc.) containing a value of one to ten can be reporteddirectly.” This guidance has been interpreted by the European MedicalAgency (EMA) to mean that only groups containing eleven or moreindividuals can be reported in de-identified data, which has led to theadoption by some of 0.09 (or 1/11) as the maximum acceptable riskpresented to a single individual. This is not to say that the EMAinterpretation applies to the US, but it is not irrelevant to judgingwhat an acceptable amount of privacy risk is for de-identified data.Similar methodologies as these may be applied under different regulatoryschemes.

Both of the above methodologies measure the maximum risk of uniquedisclosure associated with individual records. Another methodology is tomeasure the average re-identification risk across all individual recordswithin an entire de-identified database. For both approaches, averagerisk and maximum risk, conservative risk thresholds are less than 0.1.One study found that the Safe Harbor method of de-identification leadsto a risk of about 0.04, meaning that roughly 4% of patients in a SafeHarbor de-identified database are re-identifiable. The high end oftolerable risk is closer to 0.5, due to an interpretation of HIPAA thatsays the requirement is to not uniquely disclose individuals, and hencegroups as small as two can be disclosed in a data release (½=0.5).

An important consideration when determining the acceptable risk of asystem is the intended recipient of the data. In such cases where therecipient pool is more tightly controlled, it is considered acceptableto have a higher risk threshold, and in contrast systems that exposedata to the general public should err towards less risk. In theembodiments of the present invention described herein, a conservativerisk threshold of 0.05 was considered appropriate notwithstanding thatthe system is not producing de-identified information for the generalpublic's consumption.

An important property of the differential privacy system of dataanalytics platform 114 is its ability to track cumulative privacy lossover a series of statistical disclosures. This ability is referred to ascomposition. Composition allows the system to track total privacy lossfor a particular database. This idea for privacy loss is referred to asthe privacy budget, and the tools for tracking it are referred to asprivacy accounting and based in the rigorous mathematics of differentialprivacy composition techniques. A privacy budget is illustrated byexample in FIG. 21.

The privacy budget is defined by a positive number called epsilon (ε).Each table in a dataset is assigned an epsilon value (“per tableepsilon” in FIG. 20 ), and each query issued to the dataset must have anepsilon value specified as well (“per query epsilon” FIG. 20 ). Theseepsilon values control the amount of noise that is added to the resultsof queries. The higher a query's ε parameters, the more accurate theresults will be but the lower the privacy guarantees will be, and viceversa. In the illustrative embodiment, a per table epsilon of 1000 and aper query epsilon of 1 for the system are selected. As explained below,these choices reduce privacy risk to below the chosen threshold of 0.05at these configuration values.

When de-identifying data via differential privacy, it is necessary todetermine the “scope” of the budget and the conditions under which itcan be reset. There are three options for the scope of the budget. A“global” budget is a single budget for all data usage. A “project”budget is a separate budget for each project. A “user” budget is aseparate budget for each user or researcher. A key consideration forselecting the budget is determining whether or not collusion betweenusers is expected in order to attempt exfiltration of private data. Thesystem deployment for the illustrated embodiment uses project-levelbudget tracking, as it is unreasonable to expect that multipleresearchers at institutions with data use agreements in place willcollude with one another to launch privacy attacks against the system.Furthermore, the system permits budgets to be reset for a project in thecase that researcher activity logs are audited and there is no evidenceof suspicious activity, indicating benign use of the system, and also inthe case that a new version of the data is released.

Following the privacy budget illustrated in FIG. 21 , the budget beginsin this example with ε=100. A first query has an ε=0.1, so the remainingbudget then is 99.9. A second query also has an ε=0.1, and so on, untilthe budget is reduced down to 0.3. At that point the researcher runs aquery with an ε=0.5. Since this value would exceed the remaining privacybudget, the query will be blocked by data analytics platform 114. Dataanalytics platform 114 includes a memory or hardware register fortracking this value.

As described above, analysts are never directly exposed to protectedhealth information. For this reason, inadvertent re-identifications ofpatients (e.g., an analyst recognizes a neighbor or a former patient) isnot a reasonable threat to the system. Hence, the system is designed tothwart intentional attacks designed to exfiltrate private information asthe only legitimate threat vector to the system. In statisticaldisclosure control, a simple metric for assessing the privacy risk fromintentional attack is the probability of a successful privacy attack(Pr(success)), which follows the formula:

Pr(success)=Pr(success|attempt)*Pr(attempt)

where Pr(attempt) is the probability of an attack andPr(success|attempt) is the probability an attack will be successful ifit is attempted. This metric is employed to quantify privacy risk in thesystem.

The value of Pr(attempt) must be determined via expert opinion, and itis considered a best practice to err on the side of conservativeassumptions when establishing it. The two dimensions to consider whenestimating the Pr(attempt) are who is attacking (i.e., adversarymodelling) and what mitigating controls are in place. The remainder ofthis section will describe the adversary model that presents the mostsignificant risk of attempted privacy attack, summarize the mitigatingcontrols in the system, and present the determined Pr(attempt) givenconsideration of these factors.

As mentioned previously, access to the system in the illustratedembodiment is not publicly available: all users of the system will beapproved through an internal review board and will only be grantedaccess to the system for legitimate medical research. (In othernon-medical applications, of course, different safeguards may beemployed in alternative embodiments.) Due to the vetting process, asophisticated attack on the system from an authenticated user is not areasonable threat. Nevertheless, to establish conservative privacy riskassumptions for the data analytics platform 114, it has been evaluatedwith sophisticated attacks most likely to be executed by privacyresearchers. Privacy researchers are the focus because a survey ofprivacy attacks found that the majority of attacks are attempted byprivacy researchers. The motive for privacy researchers is to publishcompelling research findings related to privacy vulnerabilities, ratherthan to use the system for its intended research purposes.

To mitigate the probability that a researcher would attempt an attack,in the illustrated embodiment researchers must be affiliated with aninstitution that has a data use agreement in place with the dataprovider. The agreement imposes an explicit prohibition onre-identification, so any researcher attempting an attack must do so inknowing or misguided violation of a legal agreement. All researcherinteractions are logged and audited by the data provider periodically toverify that system usage aligns with the intended research study'sgoals. Privacy attacks have distinct and recognizable patterns, such asrandom number generators used to manipulate SQL statements,hyper-specific query filters, and rapid execution of queries with slightmodifications. These types of behaviors can be easily spotted byadministrators. Lastly, researchers are only provisioned access for theduration of a study, so the risk of the data being used outside of thecontext of a study are mitigated as well.

Given the extensive controls on the dataset and the fact thatresearchers would need to be misguided in order to attemptre-identification of patient data, the system relies upon the estimatethat less than 1 in 100 researchers provisioned access to the systemwould attempt a re-identification attack (<1%). In accordance with bestpractices, a conservative correction multiple of 10× is applied to thePr(attempt) value, for a final value of 0.10. Of course other valuescould be employed in alternative embodiments of the present invention.

When patient data are de-identified using methods such as randomizationor generalization, there is a one-to-one mapping between thede-identified data and the underlying data from which they were derived.It is this property that renders this type of data “person-level.” Inthe system described herein, there is no explicit one-to-one mappingbetween de-identified data and the underlying data from which they arederived. Instead, analysts are only exposed to aggregate data throughthe data analytics platform 114, never person-level data.

In the privacy evaluations considered in evaluating the systems andmethods described herein, three measures of privacy are used: membershipdisclosure; attribute disclosure; and identity disclosure. Membershipdisclosure occurs when an attacker can determine that a dataset includesa record from a specific patient. Membership disclosure for the presentsystem happens when a powerful attacker, one who already possesses thecomplete records of a set of patients P, can determine whether anyonefrom P is in the dataset by observing patterns in the outputs fromqueries, synthetic data, and/or machine learning models. The knowledgegained by the attacker may be limited if the dataset is well balanced inits clinical concepts. In other embodiments, the knowledge gained wouldbe limited if the dataset is well balanced in other attributes about thesubjects in the data.

Attribute disclosure occurs when an attacker can derive additionalattributes such as diagnoses and medications about a patient based on asubset of attributes already known to the researcher. Attributedisclosure is a more relevant threat because the attacker only needs toknow a subset of attributes of a patient.

Identity disclosure occurs when an attacker can link a patient to aspecific entry in the database. Due to the direct linkage of a patientto a record, the attacker will learn all sensitive information containedwithin the record pertaining to the patient.

The privacy risk of the system's SQL and synthetic data functions areevaluated independently. The privacy risk of the machine learningcapability need not be evaluated because ML models are never returned tothe user and as such do not represent privacy risk. It is theoreticallypossible to use information gained from attacking one core function toinform an attack on a different core function. For example, it istheoretically plausible to use information learned from synthetic datato inform an attack on the SQL query system. However, there are no knownattacks that accomplish this goal, and it would require a high degree ofsophistication, time, and resources to develop one. For this reason,such attacks are considered an unreasonable threat to the system andexcluded from establishing Pr(success|attempt).

To assess the privacy risk to the data provider dataset, the systemempirically evaluates the dataset's privacy risk within the system. Inthe remainder of this section, the privacy risk evaluations for thequery engine and synthetic dataset generator are provided. The privacyrisk stemming from machine learning models are not evaluated becauseusers are unable to retrieve and view model data, only evaluationmetrics. The empirical results presented in the following sections usespecific experimental setups, but technical properties of the system'sprivacy mechanisms cause the results to be highly generalizable andhence an accurate and representative assessment of privacy risk for thedataset.

As mentioned above, the data analytics platform 114 does not permitusers to view person-level data. As a result, membership disclosures canonly occur by revealing unique attributes about a patient in thedatabase. Hence, the system will evaluate attribute disclosure as theprimary threat vector for the system's query engine. As mentioned above,this property of the system means that it protects against not onlyunique disclosure of patients, but also maintains the confidentiality oftheir attributes. To quantitatively establish the risk of attributedisclosure, there are three types of attacks performed on the system:differencing, averaging, and reconstruction. The reconstruction attackis the most sophisticated and powerful of the attacks, so it is used asthe basis for establishing a conservative upper bound on the riskpresented by queries in the system, and compared to the previouslyidentified re-identification risk threshold of 0.05.

Differencing attacks aim to single out an individual in a dataset anddiscover the value of one or more specific attributes. The attack iscarried out by running aggregate queries on the target attribute anddataset both with and without the user. By taking a difference betweenthe result of the query with and without the user, the attacker attemptsto derive the value of a target attribute, despite the aggregate answerreturned by the system.

FIG. 22 shows the results of a differencing attack in a system withoutdifferential privacy. As can be seen, the attacker is able to derive acorrect length of stay for an individual patient using only four linesof code. FIG. 23 illustrates an attempt of the same attack againstcertain embodiments of the present invention. The differential privacyfunctionality prevents this attack from succeeding. The result achievedby the attacker is a length of stay of about 512 days (an unreasonablylong and obviously incorrect result), while the correct answer is 7.26days. The most important aspect of configuring the differential privacymechanisms in the system is setting the privacy budget. The chart ofFIG. 24 illustrates the results of this attack at varying values ofper-query epsilon. The resultant values differ, but all of the valuesare considered to have resulted in an unsuccessful attack due to thehigh standard deviation values.

Because the noise addition from the differential privacy mechanismsintroduces randomness, one may evaluate the results via simulation. Thesimulation process runs the differencing attack one hundred times ateach epsilon level and allows the differential privacy mechanism tocalculate noise independently at each iteration. The attack result ateach iteration is recorded and plotted in FIG. 25 for a particularexample. The mean and standard deviations of all simulations per-queryepsilon are recorded in the chart of FIG. 24 . It may be seen that for aper-query epsilon of 1.0 to 10.0, the derived values of the differencingattack are useless to an attacker (i.e., the true answer is outside ofone standard deviation of the mean attack result). At values of 100.0 to1000.0, the mean attack result is far closer to the true answer, butstill provides an uncertain, inconclusive result to the attacker.

An averaging attack is an attack designed specifically to attack systemsthat give noise-protected results. The attacker runs a single query manytimes and simply averages the results. The data analytics platform 114protects against averaging attacks by “caching” queries, or ensuringthat the exact same noisy result is provided each time the same query isrun. The data analytics platform 114 maintains memory and other storagemedia for this purpose. By providing the exact same results every timethe query is run, data analytics platform 114 does not provide theattacker with a distribution of results to average. As illustrated byFIG. 26 , only a few lines of code are required to mount an averagingattack to determine length of stay for a patient. In a system withoutcaching as illustrated in this figure, it may be seen that the attackercan successfully defeat the noise added to the system in this simpleway.

To evaluate the system's robustness to averaging attacks, one maysimulate the attack in FIG. 26 against the database at varying epsilonlevels, but without caching. The results are shown in the chart of FIG.27 . One may observe that the queries with epsilon 0.1 are far off fromthe true mean, 20.0. At 1.0, the mean is within about 10% of the truevalue, but with high variance. At a value of 10.0, the mean closelyapproximates the true mean, but the standard deviations are about 10times less than those of the runs at 1.0. A key observation in theresults is that the number of queries doesn't materially impact theaccuracy of the attack result. The biggest indicator of a successfulattack result is the query epsilon, not the number of queries used inthe attack. FIG. 28 shows the results with caching, as implemented invarious embodiments of the present invention. Because caching defeatsthe use of repeated queries to average to the result, i.e., the sameresult will be returned no matter how many times the same query is run,the chart of FIG. 28 only shows the results for the first ten queries.

Reconstruction attacks can lead to a significant privacy breach. Theyexploit a concept known as the Fundamental Law of Information Recoverywhich states that, “overly accurate answers to too many questions willdestroy privacy in a spectacular way.” Each time a query is answered bya database, it necessarily releases some information about the datasubjects in the database. Reconstruction attacks use linear programmingto derive a series of equations that can reconstruct an attribute (oreven a full database) in its entirety. This process is illustratedgraphically in FIG. 29 . By running the illustrated queries, theattacker is able to reconstruct individuals with certain attributes to ahigh degree of accuracy.

The system and methods according to certain embodiments of the inventionemploy a sophisticated reconstruction attack as described in Cohen etal, “Linear Program Reconstruction in Practice,” arXiv:1810.05692v2[cs.CR] 23 Jan. 2019, which is incorporated by referenceherein. The attack attempts to fully reconstruct the value of a clinicalattribute column for a given range of patient identifiers based on theresults of a series of aggregate queries.

It should be noted that researchers accessing data analytics platform114 do not have authorization to perform filters on identifiers such aspatient identifiers. However, a motivated attacker could attempt tosingle out patients using other means, such as using hyper-specificfilter conditions in the researcher's queries. Doing so would be awork-around to approximate a range of patient identifiers. By employingpatient identifiers in the reconstruction attack experiments, the systemestablishes a worst-case scenario estimate of the privacy leakage in thesystem.

The attack concentrates on a chosen range of one hundred patientidentifiers, and each query counts a binarized clinical attribute valueacross at least thirty-five pseudo-randomly selected patients within theidentifier range. The baseline efficacy of the attack was measured byexecuting the queries against a database that did not offer privacyprotection. With 1000 queries, the attack on the unprotected databasewas able to reconstruct the binarized clinical attribute about thepatients with perfect accuracy.

The same attack was then executed against the clinical dataset in thesystem, with three different levels of the total differential privacyepsilon budget: 100, 1,000 and 10,000. That entire budget was allocatedto the same 1000 queries (1/1000th of the budget to each query) thatallowed perfect reconstruction in the case of an unprotected database.The clinical attribute reconstruction precision, recall and accuracywere evaluated with twenty attempts at each budget level, with theresulting distributions illustrated in the graph of FIG. 30 . As can beseen, at a per-query epsilon of 10.0, the attack is able to reconstructthe binarized clinical attribute for one hundred patients withnear-perfect accuracy. At per-query epsilon values of 0.1 and 1.0, theattacker is unable to derive conclusive results about the attribute. Theexperiment thus demonstrates that the differential privacy epsilonbudget provides effective means for mitigating the reconstruction attackat per-query epsilon values of 0.1 and 1.0.

FIG. 31 provides a distribution chart illustrating the data of FIG. 30in a different manner, where patient stays are spread across the x-axis,the darkness of vertical bars indicates how often each stay waspredicted to have a positive value for the clinical attribute, withdarker color indicating more frequent prediction. The true values of theclinical attribute (ground truth) are shown in the bottom row. The lightcoloring of the top rows demonstrates the attacker uncertainty at thoselevels of per-query epsilon. As can be seen, the epsilon value greatlyinfluences the ability of the attacker to succeed.

While the foregoing analysis provides important confirmation of theeffectiveness of the data analytics platform 114 in foiling attacks, itremains to relate the query re-identification risk to regulatorythresholds, such as the applicable HIPAA threshold. The methodology setforth herein measures the probability of a reconstruction attack beingsuccessfully executed against the system. As stated previously, thereconstruction attack was chosen as the attack model for establishingestimated re-identification risk because it represents the mostsophisticated attack against the system to attempt exfiltration ofprivate information.

The probability of a successful attack is measured by simulating manyreconstruction attacks and measuring what percentage of those attacksare successful. For this purpose, the term “successful” attack isdefined as one that is able to outperform a baseline classifier. Thechart of FIG. 32 records the results of these simulations. Asrepresented, attacks against a system configured with per-query epsilonsof 0.1 and 1.0 are unsuccessful 100% of the time, and thus thePr(success) is lower than the chosen threshold of 0.05. At a per-queryepsilon of 10.0, the attacks are successful 72% of the time, and thusthe Pr(success) exceeds the target threshold and is not the chosenconfiguration for the system.

Two methods are employed to evaluate the privacy risk of the syntheticdata function in the system. The first is an attribute inference attackand the second is an identity disclosure risk measurement. The identitydisclosure represents the most significant privacy breach in the system,and it is used as the basis for the Pr(success) metric for the syntheticdata capability.

For the attribute inference attack, it is assumed that the attackersomehow obtained access to some or all of the original data, but suchresearcher only managed to obtain a subset of features and wishes toinfer missing information, similar to the setting with the queryreconstruction attacks described above. Since the attacker also hasaccess to synthetic data, which includes all features, the attacker canattempt to infer the missing values in the original data by usingsimilar records in the synthetic data. This is plausible because thesynthetic data is expected to exhibit the same statistical properties asthe original data. Results of the attack are shown in the chart of FIG.33 . As can be seen, the attack is largely unsuccessful, regardless ofwhich of the k-nn or random-forest methods is chosen. Regularization(i.e., dropout) reduces attack performance.

To establish the identity disclosure risk exposed by synthetic datasets,a risk metric is employed that functions by rigorously comparing thegenerated synthetic dataset with the original dataset from which it wasderived, and produces a conservative estimate of the identity disclosurerisk exposed by a synthetic dataset. The metric considers severalfactors about the dataset, including the number of records in thederived synthetic dataset that match records in the original dataset,the probability of errors in the original dataset, and the probabilitythat an attacker is able to verify that matched records are accurate. Asshown in FIG. 34 , the system is able to consistently produce syntheticdatasets with an identity disclosure risk far lower than the target of0.05. The synthetic data is generated from the clinical dataset at avery low regularization level (dropout=0.0) and a high regularizationlevel (0.5). Both are an order of magnitude lower than the target forthe upper bound of re-identification risk due to the probability of asuccessful attack (i.e., Pr(success)) being below ten percent in bothcases.

This expert determination as described herein relies on empiricalresults to quantify the privacy risk associated with usage of thesystem. It is not possible to perform every conceivable experiment onthe target dataset before releasing it to users. Hence, one mustconsider the generalizability of the observed empirical results. Theempirical results are a strong representation of the overall privacyrisk in the system based on two reasons.

The first is differential privacy's concept of sensitivity. Thisproperty of the technology adjusts the noise added to statisticalresults based on the re-identification risk presented per query. Thismeans that the system is not configuring total noise addition, butrather total privacy loss. Hence, the privacy risk will remainapproximately constant across different datasets and queries.

Second, the empirical evaluations described in certain embodiments ofthe invention set forth herein are considered conservative: they employattacks far more sophisticated than what one might reasonably expectwould be launched against the system. Furthermore, the system in certainembodiments adopts a conservative risk threshold of 0.05, which is asmuch as ten times less than other systems used to de-identify healthdata. For these reasons, it is believed that it is unreasonable toexpect the observed re-identification risk for a different (but similar)clinical dataset would grossly exceed the reported re-identificationrisk values set forth herein.

Another issue is dataset growth over time. A typical clinical datasetgrows continuously as patient encounters are persisted to its datawarehouse. Each additional patient encounter can contribute meaningfulinformation to researchers. New patient encounters may be added to thedataset on a “batch” basis, with a target of adding a new batch ofpatient data every one to three months. Each of these incrementaldatasets will become a new “version” of the clinical dataset. It isnecessary to evaluate to what extent the re-identification risk of oneversion is representative of the re-identification risk of a successiveversion. This should be considered in the context of both queries andsynthetic data. The generalization properties of the differentialprivacy system, as described above, mean that queries in the system areexpected to produce approximately the same re-identification risk acrosseach version of the dataset. Regarding synthetic datasets, because there-identification risk is measured for each dataset dynamically, thesystem dynamically enforces the re-identification risk to be within theestablished targets of the generated reports.

It should be noted that there are multiple points of informationdisclosure mentioned in the above workflow. These disclosures couldinclude the release of information from the dataset to externalresearchers as well as the publication of findings. The privacyguarantees of differential privacy are sensitive to informationdisclosure, meaning that as more information is disclosed about theprotected dataset—even just aggregate, non-identifiable data—the privacyguarantees afforded by differential privacy are weakened. If enoughinformation were released, an attacker could use that information inbuilding attacks against the database.

For example, an external researcher could ask the following queries andget the following answers:

Q1=COUNT(X AND Y)

R1=16

Q2=COUNT(X AND Y AND Z)

R2=14

Both R1 and R2 are differentially private. This suggests that COUNT(XAND Y AND NOT Z) would be 2. Next, imagine that a paper publishes thetrue number of X and Y:

T1=COUNT(X AND Y)

P1=12

Since COUNT(X AND Y) and COUNT(X AND Y AND Z) are correlated, theexternal researcher clearly learns that R2 cannot possibly be 14. Whatthe external researcher now knows is that the true value for R2 liessomewhere in the space (0, 12). In this example, if X=(gender=male),Y=(sex=woman), and Z=(age=22), the publication of the non-differentiallyprivate results would have contributed to information gain for anattacker without requiring extensive and sophisticated attacks.

While it may be acknowledged that the periodic publication ofnon-private aggregate statistics about the dataset can potentiallyweaken the overall privacy guarantees the system provides to externalresearchers, these types of privacy attacks are not reasonableassumptions for two reasons. The first is the complexity of multipleversions of data being created throughout the project lifecycle combinedwith the fact that external researchers only have access to the systemduring the course of the study. The result is that it is unlikely anexternal researcher will have access to the exact version of data at thesame time that published, non-private results are available for suchresearcher to launch a privacy attack. Secondly, in certain embodimentsthe users of the system are typically being evaluated and vetted byresearch professionals and are under a data use agreement, whichprohibits attempts at re-identification.

The systems and methods described herein may in various embodiments beimplemented by any combination of hardware and software. For example, inone embodiment, the systems and methods may be implemented by a computersystem or a collection of computer systems, each of which includes oneor more processors executing program instructions stored on acomputer-readable storage medium coupled to the processors. The programinstructions may implement the functionality described herein. Thevarious systems and displays as illustrated in the figures and describedherein represent example implementations. The order of any method may bechanged, and various elements may be added, modified, or omitted.

A computing system or computing device as described herein may implementa hardware portion of a cloud computing system or non-cloud computingsystem, as forming parts of the various implementations of the presentinvention. The computer system may be any of various types of devices,including, but not limited to, a commodity server, personal computersystem, desktop computer, laptop or notebook computer, mainframecomputer system, handheld computer, workstation, network computer, aconsumer device, application server, storage device, telephone, mobiletelephone, or in general any type of computing node, compute node,compute device, and/or computing device. The computing system includesone or more processors (any of which may include multiple processingcores, which may be single or multi-threaded) coupled to a system memoryvia an input/output (I/O) interface. The computer system further mayinclude a network interface coupled to the I/O interface.

In various embodiments, the computer system may be a single processorsystem including one processor, or a multiprocessor system includingmultiple processors. The processors may be any suitable processorscapable of executing computing instructions. For example, in variousembodiments, they may be general-purpose or embedded processorsimplementing any of a variety of instruction set architectures. Inmultiprocessor systems, each of the processors may commonly, but notnecessarily, implement the same instruction set. The computer systemalso includes one or more network communication devices (e.g., a networkinterface) for communicating with other systems and/or components over acommunications network, such as a local area network, wide area network,or the Internet. For example, a client application executing on thecomputing device may use a network interface to communicate with aserver application executing on a single server or on a cluster ofservers that implement one or more of the components of the systemsdescribed herein in a cloud computing or non-cloud computing environmentas implemented in various sub-systems. In another example, an instanceof a server application executing on a computer system may use a networkinterface to communicate with other instances of an application that maybe implemented on other computer systems.

The computing device also includes one or more persistent storagedevices and/or one or more I/O devices. In various embodiments, thepersistent storage devices may correspond to disk drives, tape drives,solid state memory, other mass storage devices, or any other persistentstorage devices. The computer system (or a distributed application oroperating system operating thereon) may store instructions and/or datain persistent storage devices, as desired, and may retrieve the storedinstruction and/or data as needed. For example, in some embodiments, thecomputer system may implement one or more nodes of a control plane orcontrol system, and persistent storage may include the SSDs attached tothat server node. Multiple computer systems may share the samepersistent storage devices or may share a pool of persistent storagedevices, with the devices in the pool representing the same or differentstorage technologies.

The computer system includes one or more system memories that may storecode/instructions and data accessible by the processor(s). The system'smemory capabilities may include multiple levels of memory and memorycaches in a system designed to swap information in memories based onaccess speed, for example. The interleaving and swapping may extend topersistent storage in a virtual memory implementation. The technologiesused to implement the memories may include, by way of example, staticrandom-access memory (RAM), dynamic RAM, read-only memory (ROM),non-volatile memory, or flash-type memory. As with persistent storage,multiple computer systems may share the same system memories or mayshare a pool of system memories. System memory or memories may containprogram instructions that are executable by the processor(s) toimplement the routines described herein. In various embodiments, programinstructions may be encoded in binary, Assembly language, anyinterpreted language such as Java, compiled languages such as C/C++, orin any combination thereof; the particular languages given here are onlyexamples. In some embodiments, program instructions may implementmultiple separate clients, server nodes, and/or other components.

In some implementations, program instructions may include instructionsexecutable to implement an operating system (not shown), which may beany of various operating systems, such as UNIX, LINUX, Solaris™, MacOS™,or Microsoft Windows™. Any or all of program instructions may beprovided as a computer program product, or software, that may include anon-transitory computer-readable storage medium having stored thereoninstructions, which may be used to program a computer system (or otherelectronic devices) to perform a process according to variousimplementations. A non-transitory computer-readable storage medium mayinclude any mechanism for storing information in a form (e.g., software,processing application) readable by a machine (e.g., a computer).Generally speaking, a non-transitory computer-accessible medium mayinclude computer-readable storage media or memory media such as magneticor optical media, e.g., disk or DVD/CD-ROM coupled to the computersystem via the I/O interface. A non-transitory computer-readable storagemedium may also include any volatile or non-volatile media such as RAMor ROM that may be included in some embodiments of the computer systemas system memory or another type of memory. In other implementations,program instructions may be communicated using optical, acoustical orother form of propagated signal (e.g., carrier waves, infrared signals,digital signals, etc.) conveyed via a communication medium such as anetwork and/or a wired or wireless link, such as may be implemented viaa network interface. A network interface may be used to interface withother devices, which may include other computer systems or any type ofexternal electronic device. In general, system memory, persistentstorage, and/or remote storage accessible on other devices through anetwork may store data blocks, replicas of data blocks, metadataassociated with data blocks and/or their state, database configurationinformation, and/or any other information usable in implementing theroutines described herein.

In certain implementations, the I/O interface may coordinate I/O trafficbetween processors, system memory, and any peripheral devices in thesystem, including through a network interface or other peripheralinterfaces. In some embodiments, the I/O interface may perform anynecessary protocol, timing or other data transformations to convert datasignals from one component (e.g., system memory) into a format suitablefor use by another component (e.g., processors). In some embodiments,the I/O interface may include support for devices attached throughvarious types of peripheral buses, such as a variant of the PeripheralComponent Interconnect (PCI) bus standard or the Universal Serial Bus(USB) standard, for example. Also, in some embodiments, some or all ofthe functionality of the I/O interface, such as an interface to systemmemory, may be incorporated directly into the processor(s).

A network interface may allow data to be exchanged between a computersystem and other devices attached to a network, such as other computersystems (which may implement one or more storage system server nodes,primary nodes, read-only node nodes, and/or clients of the databasesystems described herein), for example. In addition, the I/O interfacemay allow communication between the computer system and various I/Odevices and/or remote storage. Input/output devices may, in someembodiments, include one or more display terminals, keyboards, keypads,touchpads, scanning devices, voice or optical recognition devices, orany other devices suitable for entering or retrieving data by one ormore computer systems. These may connect directly to a particularcomputer system or generally connect to multiple computer systems in acloud computing environment, grid computing environment, or other systeminvolving multiple computer systems. Multiple input/output devices maybe present in communication with the computer system or may bedistributed on various nodes of a distributed system that includes thecomputer system. The user interfaces described herein may be visible toa user using various types of display screens, which may include CRTdisplays, LCD displays, LED displays, and other display technologies. Insome implementations, the inputs may be received through the displaysusing touchscreen technologies, and in other implementations the inputsmay be received through a keyboard, mouse, touchpad, or other inputtechnologies, or any combination of these technologies.

In some embodiments, similar input/output devices may be separate fromthe computer system and may interact with one or more nodes of adistributed system that includes the computer system through a wired orwireless connection, such as over a network interface. The networkinterface may commonly support one or more wireless networking protocols(e.g., Wi-Fi/IEEE 802.11, or another wireless networking standard). Thenetwork interface may support communication via any suitable wired orwireless general data networks, such as other types of Ethernetnetworks, for example. Additionally, the network interface may supportcommunication via telecommunications/telephony networks such as analogvoice networks or digital fiber communications networks, via storagearea networks such as Fibre Channel SANs, or via any other suitable typeof network and/or protocol.

Any of the distributed system embodiments described herein, or any oftheir components, may be implemented as one or more network-basedservices in the cloud computing environment. For example, a read-writenode and/or read-only nodes within the database tier of a databasesystem may present database services and/or other types of data storageservices that employ the distributed storage systems described herein toclients as network-based services. In some embodiments, a network-basedservice may be implemented by a software and/or hardware system designedto support interoperable machine-to-machine interaction over a network.A web service may have an interface described in a machine-processableformat, such as the Web Services Description Language (WSDL). Othersystems may interact with the network-based service in a mannerprescribed by the description of the network-based service's interface.For example, the network-based service may define various operationsthat other systems may invoke, and may define a particular applicationprogramming interface (API) to which other systems may be expected toconform when requesting the various operations.

In various embodiments, a network-based service may be requested orinvoked through the use of a message that includes parameters and/ordata associated with the network-based services request. Such a messagemay be formatted according to a particular markup language such asExtensible Markup Language (XML), and/or may be encapsulated using aprotocol such as Simple Object Access Protocol (SOAP). To perform anetwork-based services request, a network-based services client mayassemble a message including the request and convey the message to anaddressable endpoint (e.g., a Uniform Resource Locator (URL))corresponding to the web service, using an Internet-based applicationlayer transfer protocol such as Hypertext Transfer Protocol (HTTP). Insome embodiments, network-based services may be implemented usingRepresentational State Transfer (REST) techniques rather thanmessage-based techniques. For example, a network-based serviceimplemented according to a REST technique may be invoked throughparameters included within an HTTP method such as PUT, GET, or DELETE.

Unless otherwise stated, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this invention belongs. Although any methods andmaterials similar or equivalent to those described herein can also beused in the practice or testing of the present invention, a limitednumber of the exemplary methods and materials are described herein. Itwill be apparent to those skilled in the art that many moremodifications are possible without departing from the inventive conceptsherein.

All terms used herein should be interpreted in the broadest possiblemanner consistent with the context. When a grouping is used herein, allindividual members of the group and all combinations andsub-combinations possible of the group are intended to be individuallyincluded. When a range is stated herein, the range is intended toinclude all subranges and individual points within the range. Allreferences cited herein are hereby incorporated by reference to theextent that there is no inconsistency with the disclosure of thisspecification.

The present invention has been described with reference to certainpreferred and alternative embodiments that are intended to be exemplaryonly and not limiting to the full scope of the present invention, as setforth in the appended claims.

1. A platform to support hypothesis testing on private data, wherein thesystem comprises: at least one processor; a database comprising aprivate dataset in communication with the at least one processor; atleast one non-transitory media in communication with the processor,wherein the non-transitory media comprises an instruction set comprisinginstructions that, when executed at the at least one processor incommunication with the at least one processor, are configured to:receive from an external analytics computer system a request for anoperation against the private dataset; calculate an answer to therequest from the external analytics computer system; apply noise to theanswer from the external analytics computer system to produce a noisyresult; return the noisy result to the external analytics computersystem; receive from the external analytics computer system a requestfor hypothesis testing against the private dataset; perform a hypothesistest against the private dataset; calculate a final result from theperformance of the hypothesis test against the private dataset, whereinthe final result is protective of the privacy of the private dataset;and return the final result to the external analytics computer system.2. The system of claim 1, wherein the instruction set, when executed atthe at least one processor in communication with the at least oneprocessor, is further configured to apply differential privacy to theprivate dataset to produce the noisy result.
 3. The system of claim 1,wherein the final result is a set of aggregate statistical results. 4.The system of claim 1, wherein the instruction set, when executed at theat least one processor in communication with the at least one processor,is further configured to provide the final result to an internalanalytics computer system, and release the final result to the externalanalytics computer system if an approval is received from the internalanalytics computer system.
 5. The system of claim 4, wherein theinstruction set, when executed at the at least one processor incommunication with the at least one processor, is further configured to:receive a query from the internal analytics computer system; execute thequery against the private dataset; and return a set of true results tothe internal analytics computer system.
 6. The system of claim 1,wherein the instruction set, when executed at the at least one processorin communication with the at least one processor, is further configuredto: receive a machine learning training or evaluation request from theexternal analytics computer system; ingest the machine learning trainingor evaluation request and perform the machine learning task against theprivate dataset; and return a machine learning training or evaluationresult to the external analytics computer system.
 7. The system of claim6, wherein the machine learning training or evaluation result comprisessummary statistics without a machine learning model.
 8. The system ofclaim 4, wherein the instruction set, when executed at the at least oneprocessor in communication with the at least one processor, is furtherconfigured to: receive a machine learning training or evaluation requestfrom an internal analytics computer system; ingest the machine learningtraining or evaluation request and perform the machine learning taskagainst the private dataset; return a set of summary statistics to theinternal analytics computer system; receive a request for the machinelearning model from the internal analytics computer system; and retrieveand return the machine learning model to the internal analytics computersystem.
 9. The system of claim 8, wherein the machine learning trainingor evaluation result comprises summary statistics without a machinelearning model.
 10. The system of claim 1, wherein the instruction set,when executed at the at least one processor in communication with the atleast one processor, is further configured to: receive a request for asynthetic dataset from the external analytics computer system; generatea synthetic dataset from the private dataset; evaluate the privacy ofthe synthetic dataset; and based on the result of the evaluation of theprivacy of the synthetic dataset, return the synthetic dataset to theexternal analytics computer system.
 11. The system of claim 1, whereinthe instruction set, when executed at the at least one processor incommunication with the at least one processor, is further configured to,for each request for an operation against the private dataset from theexternal analytics computer system, calculate a query epsilon budget,apply a per-query epsilon budget, and only return the noisy result tothe external analytics computer system if the per-query epsilon budgethas not been exceeded by the request for an operation.
 12. The system ofclaim 11, wherein the instruction set, when executed at the at least oneprocessor in communication with the at least one processor, is furtherconfigured to maintain a per-project epsilon budget, calculate a queryepsilon for a plurality of requests for an operation against the privatedataset from the external analytics computer, calculate a projectepsilon by summing each of the previous query epsilons, and only returnthe noisy result to the external analytics system if the project epsilondoes not exceed the per-project epsilon budget.
 13. The system of claim12, wherein the instruction set, when executed at the at least oneprocessor in communication with the at least one processor, is furtherconfigured to calculate at least one of a per-query epsilon budget and aproject epsilon budget by simulating a privacy attack against theprivate dataset.
 14. The system of claim 13, wherein the privacy attackis a linear programming reconstruction attack.
 15. The system of claim14, wherein the epsilon is set to correspond to a probability of successfor the privacy attack at no more than 0.05.
 16. A method for testing ahypothesis using private data, the method comprising the steps of: at adata analytics platform, receiving from an external analytics computersystem a request for an operation against a private dataset stored in adatabase connected to the data analytics platform; at the data analyticsplatform, calculating an answer to the request from the externalanalytics computer system; at the data analytics platform, applyingnoise to the answer from the external analytics computer system toproduce a noisy result; returning from the data analytics platform thenoisy result to the external analytics computer system; at the dataanalytics platform, receiving from the external analytics computersystem a request for hypothesis testing against the private dataset; atthe data analytics platform, performing a hypothesis test against theprivate dataset in the database; at the data analytics platform,calculating a final result from the performance of the hypothesis testagainst the private dataset, wherein the final result is protective ofthe privacy of the private dataset; and returning from the dataanalytics platform the final result to the external analytics computersystem.
 17. The method of claim 16, further comprising the step ofapplying differential privacy to the private dataset to produce thenoisy result.
 18. The method of claim 16, wherein the final result is aset of aggregate statistical results.
 19. The method of claim 16,further comprising the steps of: providing the final result from thedata analytics platform to an internal analytics computer system; andreleasing the final result from the data analytics platform to theexternal analytics computer system if an approval is received from theinternal analytics computer system.
 20. The method of claim 19, furthercomprising the steps of: at the data analytics platform, receiving aquery from the internal analytics computer system; at the data analyticsplatform, executing the query against the private dataset; and returningfrom the data analytics platform a set of true results to the internalanalytics computer system.
 21. The method of claim 16, furthercomprising the steps of: receiving at the data analytics platform amachine learning training or evaluation request from the externalanalytics computer system; ingesting the machine learning training orevaluation request at the data analytics platform; at the data analyticsplatform, performing the machine learning task against the privatedataset in the database; and returning from the data analytics platforma machine learning training or evaluation result to the externalanalytics computer system.
 22. The method of claim 21, wherein themachine learning training or evaluation result comprises summarystatistics without a machine learning model.
 23. The method of claim 19,further comprising the steps of: receiving at the data analyticsplatform a machine learning training or evaluation request from aninternal analytics computer system; ingesting at the data analyticsplatform the machine learning training or evaluation request; performingat the data analytics platform the machine learning task against theprivate dataset; returning from the data analytics platform a set ofsummary statistics to the internal analytics computer system; receivingat the data analytics platform a request for the machine learning modelfrom the internal analytics computer system; and at the data analyticsplatform, retrieving and returning the machine learning model to theinternal analytics computer system.
 24. The method of claim 23, whereinthe machine learning training or evaluation result comprises summarystatistics without a machine learning model.
 25. The method of claim 16,further comprising the steps of: at the data analytics platform,receiving a request for a synthetic dataset from the external analyticscomputer system; at the data analytics platform, generating a syntheticdataset from the private dataset in the database; at the data analyticsplatform, evaluating the privacy of the synthetic dataset; and based onthe result of the evaluation of the privacy of the synthetic dataset,sending the synthetic dataset from the data analytics platform to theexternal analytics computer system.
 26. The method of claim 16, furthercomprising the step of, for each request for an operation against theprivate dataset from the external analytics computer system to the dataanalytics platform, calculating a query epsilon budget, applying aper-query epsilon budget, and only returning the noisy result to theexternal analytics computer system if the per-query epsilon budget hasnot been exceeded by the request for an operation.
 27. The method ofclaim 26, further comprising the steps of: maintaining at the dataanalytics platform a per-project epsilon budget; calculating at the dataanalytics platform a query epsilon for a plurality of requests for anoperation against the private dataset from the external analyticscomputer; and calculating at the data analytics platform a projectepsilon by summing each of the previous query epsilons, and onlyreturning the noisy result to the external analytics system if theproject epsilon does not exceed the per-project epsilon budget.
 28. Themethod of claim 27, further comprising the step of calculating at thedata analytics platform at least one of a per-query epsilon budget and aproject epsilon budget by simulating a privacy attack against theprivate dataset.
 29. The method of claim 28, wherein the privacy attackis a linear programming reconstruction attack.
 30. The method of claim29, wherein the epsilon is set to correspond to a probability of successfor the privacy attack at no more than 0.05.